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Abstract: Wc find upper bounds for the probabiUty of underestimation 
and ovcrestimation errors in penalized likelihood context tree estimation. 
The bounds are explicit and applies to processes of not necessarily finite 
memory. We allow for general penalizing terms and we give conditions over 
the maximal depth of the estimated trees in order to get strongly consis- 
tent estimates. This generalizes previous results obtained in the case of 
estimation of the order of a Markov chain. 
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1. Introduction 

In this paper we obtain an exponential upper bound for the underestimation 
of the context tree of a variable memory process by penalized likelihood (PL) 
criteria and a sub-exponential upper bound for the ovcrestimation event. Our 
result applies to processes of not necessarily finite memory that satisfies some 
continuity requirements, generalizing the bound obtained in Dorea and Zhao 
(2006) for the estimation of the order of a Markov chain by similar methods 
(EDC criterion). 

The concept of context tree was first introduced by Rissanen (1983) to denote 
the minimum set of sequences that are necessary to predict the next symbol in 
a finite memory stochastic chain. A particular case of context tree is the set 
of all sequences of length /c, representing a Markov chain of order k. For that 
reason, context trees allow a more detailed and parsimonious representation of 
processes than finite order Markov chains do. 

In the statistical literature, the processes allowing a context tree representa- 
tion are called Variable Length Markov Chains (Biihlmann and Wyner; 1999). 

*This work is part of PRONEX/FAPESP's project Stochastic behavior, critical phenom- 
ena and rhythmic pattern identification in natural languages (grant number 03/09930-9), 
CNRS-FAPESP project Probabilistic phonology of rhythm and CNPq project Rhythmic pat- 
terns, prosodic domains and probabilistic modeling in Portuguese Corpora (grant number 
485999/2007-2). 
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This class of models has shown to be useful in real data modeling, as for exam- 
ple, for the case of protein classification into families (Bejerano and Yona; 2001; 
Leonardi; 2006). 

Historically, the estimation of the context tree of a process has been ad- 
dressed by different versions of the algorithm Context, introduced by Rissanen 
in its seminal paper. This algorithm was proven to be weak consistent in the 
case of bounded memory (Biihlmann and Wyner; 1999) and also in the case of 
unbounded memory (Ferrari and Wyner; 2003; Duarte et al.; 2006). Recently, in 
Galves et al. (2008) it was obtained an upper bound for the rate of convergence 
of the algorithm Context in the case of bounded memory processes. A general- 
ization of this result to the case of unbounded memory processes was given in 
Galves and Leonardi (2008). 

The estimation of context trees by PL criteria had not been addressed in the 
literature until the recent work by Csiszar and Talata (2006). The reason for 
that was the exponential cost of the estimation, due to the number of trees that 
had to be considered in order to find the optimal one. In their article, Csiszar 
and Talata showed that the Bayesian Information Criterion (BIC), which is a 
particular case of the PL estimators (using a penalizing term growing logarith- 
mically), is strongly consistent and can be computed in linear time, using a 
suitable version of the Context Tree Weighting method of Willems, Shtarkov 
and Tjalkens (Willems et al.; 1995; Willems; 1998). Their result applies to un- 
bounded memory processes and the depth of the estimated tree is allowed to 
grow with the sample size as a sub-logarithmic function. This last condition was 
proven to be unnecessary in the case of finite memory processes, as proven in 
Garivier (2006). An explicit bound on the rate of convergence of the PL context 
tree estimators had remained until now as an open question. 

The paper is organized as follows. In Section 2 we introduce some definitions 
and state the main result. In Section 3 we present the proofs and in Section 4 we 
do some final remarks. Finally, Section 5 constitutes and appendix that contains 
some results needed in our proofs and obtained elsewhere in the literature. 

2. Definitions and results 

In what follows A will represent a finite alphabet of size \A\. Given two integers 
m < n, we will denote by the sequence (lUm, ■ ■ ■ ,Wn) of symbols in A. The 
length of the sequence is denoted by €(«)„) and is defined by £{w^) = 
n — m + 1. Any sequence with m > n represents the empty string and is 
denoted by A. The length of the empty string is £{X) = 0. In the sequel A^ will 
denote the set of all sequences of length j over A. 

Given two sequences w = wl^ and v = Vj, we will denote by vw the sequence 
of length £{v) + i{w) obtained by concatenating the two strings. In particular, 
Xw = wX = w. The concatenation of sequences is also extended to the case in 
which V denotes a semi-infinite sequence, that is v = {. . . , V-2, w-i), denoted by 
= vZl^- 

We say that the sequence s is a suffix of the sequence w if there exists a 
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sequence u, with £{u) > 1, such that w — us. In this case we write s ^ w. When 
s ~< w or s = w we write s ^ w. 

Definition 2.1. A set T of finite or semi-infinite sequences is a tree if no 
sequence s G T is a suffix of another sequence w £ T. This property is called 
the sujfix property. 

We define the height of the tree T as 

h{T) = snp{e{w) : w £ T}. 

In the case h{T) < +oo we say that T is bounded and we denote by |T| the 
number of sequences in T. On the other hand, if = -|-c» we say that the 
tree T is unbounded. 

Given a tree T and an integer K we will denote by T\k the tree T truncated 
to level K, that is 

T\k = e T: £{w) < K}li {w: £{w) = K and w -< u, for some u e T}. 

The expression Int(T) will denote the set of all sequences that are suffixes of 
some u £ T, that is 

Int(T) ^ {w: w ^ u, for some u € T}. 

We will say that a tree T is complete if for every semi-infinite sequence 
there exists a sequence s €T such that s ^ wZ]^. 

Consider a stationary ergodic stochastic chain {Xt : t G Z} over A. Given a 
sequence w Q A' wc denote by 

p{w) P{X( = w) 

the stationary probability of the cylinder defined by the sequence w. If p{w) > 
we write 

p{a\w) = F{Xo = a\XZj = w) . 
In the sequel we will use the simpler notation Xt for the process {Xt : t G Z}. 
Definition 2.2. A sequence w G is a context for the process Xt if it satisfies 

1. For any semi-infinite sequence xZ]^ having w as a suffix 

P{Xo - a\Xzi, = xZlo) = p(a|w), for aU a e A. 

2. No suffix of w satisfies (1). 

An infinite context is a semi-infinite sequence wZlxi such that any of its suffixes 
wZj, J = 1, 2, . . . is a context. 

Definition 2.2 implies that the set of all contexts (finite or infinite) satisfies 
the suffix property and hence it is a tree. This tree is called the context tree of 
the process Xt and will be denoted by Tq. 
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Remark 2.3. In this paper we will also consider i.i.d. processes. We will assume 
that these processes are compatible with a particular tree, given by the set {A}. 

Define the sequence {a^j/cgN as 

"0 inf Ap{a\w) }, 

weTo,a£A 

afc inf inf {p{a\w)}. (2.4) 

Assumption 1. From now on we will assume the process Xt satisfies 

1. ao > and 

2- a := EfeeN(l " "fe) < 

The positivity assumption over ao implies that the context tree of the process 
Xt is complete, i.e., any semi- infinite sequence belongs to Tq or has a suffix 
that belongs to Tq. The second assumption is related to the loss of memory of 
a process of infinite order, (see Galves and Leonardi (2008) for more details). 

In what follows we will assume xi,X2, ■ ■ ■ ,Xn is a sample of the process Xt- 
Let d{n) < n be a function taking integer values and growing to infinity with 
n. This will denote the maximal height of the estimated context trees (and will 
be denoted simply by d). Then, given a sequence w, with 1 < £(w) < d, and a 
symbol a G A we denote by iV„(w,a) the number of occurrences of symbol a 
preceded by the sequence w, starting at d + 1, that is, 

n 

Nn{w,a) = ^ l{xlz](^^) ^w,xt = a}. (2.5) 

t=d+l 

On the other hand, Nn{w) will denote the sum X^aeA ^niw, a). 

Definition 2.6. We will say that the tree T is feasible if it is complete, h{T) < 
d, Nn{w) > 1 for all w € T and any string w' with Nn{w') > 1 either belongs 
to T, is a sufRx of some w € T or has a suffix w that belongs to T. 

We will denote by JF'^(.t") the set of all feasible trees. Then, given a tree 
T e J^'^{xi), the maximum likelihood of the sequence xi, . . . ,Xn is given by 

PML,r(xl') = n n Pnialwf-^'^'^'K (2.7) 

weT aeA 

where the empirical probabilities p„{a\w) are given by 

Pn{a\w) = . 2.8) 

Nn(W) 

Here and in the sequel we use the convention O'' = 1 , for example in the case of 
Nn{w,a) = in expression 2.7. Note that by Definition 2.6, as Nn{w) > 1 for 
any w G T, it is not necessary to give an extra definition of Pn{a\w) in the case 

Nn{w) - 0. 
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Given a sequence w, with Nn{'w) > 1, we will denote by 
Hence, we have 

PmL.tW) = n ^MUwK). 

Let f{n) be any positive function such that f{n) +00, when n — > +00, and 
f(n) — > 0, when n — > +00. This function will represent the generic penalizing 
term of our estimator, replacing the function ^^\^^ log n in the classical definition 
of BIG (Gsiszar and Talata; 2006). A function satisfying these conditions will 
be called penalizing term. 

Definition 2.9. Given a penalizing term f{n), the PL context tree estimator 
is given by 

fix'l) = arg min { - logPML.r(x;^) + |r|/(n) }. (2.10) 

As can be seen, the computation of the estimated context tree using its raw 
definition would imply a search for the optimal tree on the set of all feasible 
trees. This was the biggest drawback of this approach, because the size of this 
set grows extremely fast as a function of the maximal height d. Fortunately, 
there is a way of computing the PL estimator without exploring the set of all 
trees, as shown by Gsiszar and Talata (2006). The details of this algorithm are 
given in the Appendix and will be used in the proof of our main result. 

Let K G 'M. Define the underestimation event with respect to the truncated 
tree To\k hy 

U,'^- U {wefnix",)} 

iD£lnt{To\K) 

and the overestimation event by 

w>-vi£To,e(v)<K 

We are ready to present the main result in this paper. It establishes upper 
bounds for the probability of occurrence of the underestimation and overesti- 
mation events. 

Theorem 2.11. Let zi, X2^ ... be a sample of the stationary ergodic stochastic 
process Xt having context tree Tq and satisfying Assumption 1. For any con- 
stant K ^ 'N there exist an integer uq and positive constants ci, C2, C3 and C4 
depending on the process Xt such that for any n > uq 

(a) P[[/,f] < cie-'^^t"-'^); 

(b) P[0,f] < c3|A|''e-=*^(")("o/l^l)*'/d. 
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Corollary 2.12. For any penalizing term f{n) and any function d{n) such tfiat 
for any constant c > 0, 

5:|Ar")exp[-:^(^]<+oo (2.13) 

we have that there exists an integer uq depending on the process Xt such that 
Tn{x1)\K = Tq\k for any n > hq. 

3. Proof of Theorem 2.11 

Using Definition 5.5 and Lemma 5.7 and we see that the tree in (2.10) can be 
written as 

fix'l) ^{we U^liAJ : X.u,{x'i) = 0, X^ix";;) ^ 1 for all v ^ w} 

if X\{x^) ~ 1, and to {A} if X\{x") = 0. Then, for n sufficiently large in order 
to guarantee that To\k will be in !F'^{xi) we have that 

u^-^ U {XU^i) = 0} 

welnt{To\K) 

and 

Ofc U {X,,{x-) = 1}. 
veTo.e{v)<K 

To prove (a) let w £ Int(7o|x), then using Definition 5.4 and Lemma 5.6 we 
have that 

F[XUx",) =0] = P[ n ^ e~^^"^ff'ML.»(x?) ] 

and for any a & A 

where ^awi^i) the set containing all trees T that have the form T = T' n 
{u:ut aw}, with T' G T'^ixl). Then 

P[XUx'1) = 0] = P[ max TT e-/(")PML..,(a-r) < e-/(")PML,»(x^ ) ] ■ 

For a tree T G .F^(.t") define the quantity 

Sriw) = p{ua) logp(a|u) — p(wa) logp(a|u')] . (3-1) 

Using Jensen's inequality we can see that 5t{w) > unless p{a\w) = p{a\u) 
for all a G A and all u E T. Therefore, for a sufficiently large n there must be 
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a tree 7^ G J-^,{xi) such that St^{w) > 0; if not we contradict the fact that 
w G Int(7o) and it is not a context in the sense of Definition 2.2. Therefore 

P[XUx",)=0] < P[ n e-^^"^PML..K) < e-/(")PML,»(xi')]. 

Now we can apply the logarithm function on both sides inside the probability 
obtaining that the right hand side equals 

P[ ^ logPML,«K)-logPML,»W) < (|7:|-l)/(n)]. 

Dividing hy n — d and subtracting on both sides the term Sq-^ (w) we have that 
for a sufficiently large n such that 

n-d 2 
we can bound above the last expression by 

where for any finite sequence s 

Ln{s) = V'p(sa) logp(a|s) - -^"(^'") \ogp^^{^a\s) . 
. n — d 

Using Corollary 5.9 we can bound above this expression by 

' "''^ 1024e|A|3(a + ao)log2ao/i(T^) ^ 

We conclude the proof of part (a) by observing that we only have a finite number 
of sequences w G Int(7o|/f), so we can take 

ci= max {3e^|A|2(l + |r^|)} 

w£lnt{To\K) 

and 

■ tx X2 \ 2(/i(r^)+i) 
mm(tfT^,5f, )"o ' ^ ' 

C2 = mm \ ^ \ . 

weint{To\K) 1024e| A|3(a + ao) log^ aoh{T^) 

To prove part (b) observe that for any w ^ To with £{w) < K 

P[ A-^K') = 1] = P[ n ^--(^i) > e-^(")PML,»K) ]. (3.2) 
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Using Lemma 5.6 we have that 

Then, applying the logarithm function the probability (3.2) is equal to 

P[ J2 loge-^("'PML,nK) >loge-^(")PML.»W)] (3-3) 

= P[logPML,«,K) - E logPML,.K) < (1 - \TUx",)\)f{n)]. 

We know, by the maximum likelihood estimator of the transition probabilities 
that 

Pml,»K) > l[p{a\w)''-^^'-\ (3.4) 

aeA 

Therefore, we can bound above the right hand side of (3.3) by 

F[J2Nniw,a)\ogpia\w) - ^ logPML,«(4') ] < (1 - |r^(x-?)|)/(n)] 

= F[J2 E A^n(«,«)log^^]<(l-|r^(x?)|)/(n)]. 

aeAueT^ixl) Pn[a\u) 

This equality follows by substituting Nn{w,a) by X^ueT ^^'^ ^^"^ 

fact that p{a\u) ~ p{a\w) for all u £ 7^(2;"), remembering that w G To- Observe 
that 

= - E ^n(")^(Pn(-h)||p(»), 

Mer,„(2;5') 

where D is the Kullback-Leibler divergence between the two distributions Pn{-\u) 
and p(-\u) (see the Appendix). Using Lemma 5.2 and dividing by n — d we have 
that 

P[- E N^{u)D{pM\\p{-\u))]<il~\%.{x^mf{n)] 

< pr iV«(^) [pn{a\u)-p{a\u)Y (1 - |T^(a^y)|)/(n) . 

~ ^ n~d ^ p(a\u) ^ n-d ^' 

ueT^ix^l) aeA ' ' 

As Xw{xi) = 1 it follows that |Tu,(x")| > 1. On the other hand, Nn{u) <n-d 
and f{n) > 0. Therefore, we can bound above the right hand side of the last 
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expression by 

f{n)p{a\u) 



TP[\Pn{a\u) - p{a\u)\ > 



{n-d)\A\\T^{x^l)\ 



Hence, using Corollary 5.9 we can bound above this expression by 
2e* \Af+'^ exp[ 



r, \ 2(d+l) 



32e(a + ao)|^|''+^d 
This finishes the proof of Theorem 2.11, by taking 

cg = 2e~|Ap and C4 



32e(a + ao)|^P' 

Proof of Corollary 2.12. It follows from the Borel-CantcUi Lemma and Theo- 
rem 2.11, by noting that 

p[r«)|K^ro|^]<p[c/f] + p[of] 

and the right hand side is summable in n when condition (2.13) is satisfied. 



4. Final Remarks 

The present paper presents upper bounds for the rate of convergence of pe- 
nalized likelihood context tree estimators. We obtain an exponential bound for 
the underestimation event and an under-exponential bound in the case of the 
overestimation event. These results generalizes the previous work by Dorea and 
Zhao (2006), who obtained similar bounds in the case of the estimation of the 
order of a Markov chain, using also penalized likelihood criteria. One question 
that still remains open is if these bounds are optimal, as in the case of an es- 
timator introduced in Finesso et al. (1996) for the estimation of the order of 
a Markov chain. They prove that in the case of their estimator, the constant 
appearing in the underestimation bound is optimal, and that the overestimation 
bound can not be exponential if the estimator is universal, as in our case. The 
answer to these questions are important subjects for future work in this area. 



5. Appendix 

5.1. The context tree maximizing principle 

The following definitions and results were taken from Csiszar and Talata (2006) 
and were included for completeness. Definitions 5.4 and 5.5 and Lemmas 5.6 
and 5.7 were originally proven for the usual penalizing term f{n) = ^^^2^ logn, 
but can be adapted in a straightforward way to our setting. 
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Given two probability distributions p and q over A, the Kullback-Leibler di- 
vergence is defined by 



where, by convention, p(a) log equals if p(a) = and +oo if p(a) > q{a) = 
0. 

Lemma 5.2. If p and q are two probability distributions over A then 



Consider the full tree A'^, and let S"* denote the set of all sequences of length 
at most d, that is S"* = ^j=o^^ ■ 

Definition 5.4. Given a sequence w G S"^ with Nn{w) > 1, we define recur- 
sively, starting from the sequences of the full tree A"^, the value 



and the indicator 

r 1, if < £{w) < d and JJaeA > c-^WPml.^ W), 

XU^i) = < 0, if < i{w) < d and UaeA^'^Ux'l) < e"^("^PML.»K), 
[o, iie{w)^d. 

Definition 5.5. Given w £ S"^ with Nn{w) > 1, the maximizing tree assign to 
the sequence w is the tree 

Tw{xi) = {ueS'^: Xu{x'l) = 0, Xvix'l) = 1 for all w < v u} 

if Xn,{xi) = 1 and %,{x'^) = {w} if X^{x'l) = 0. 

For a sequence w G S'^, with Nn{w) > 1, define jr^(a;") as the set containing 
ah trees T that have the form T = T' n {u: u t w}, with T' e T'^{x1). 

Lemma 5.6. For any w G S"^ with Nn{w) > 1, 




(5.1) 




(5.3) 



Proof. See Csiszar and Talata (2006, Lemma 6.3). 



□ 




VUx^l) 



max 



n e-^(")PML.«(x?) = n e--''(")PML.«(a;?). 




Proof. See Csiszar and Talata (2006, Lemma 4.4). 



□ 
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Lemma 5.7. The context tree estimator T [xi) in (2.10) equals the maximizing 
tree assigned to the empty string A, that is, 

rw) = r,K). 

Proof. Sec Csiszar and Talata (2006, Proposition 4.3). □ 

From this result it follows that in order to obtain the tree maximizing the 
penalized maximum likelihood criteria it is sufficient to assign to each sequence 
w G S'^, with Nn{w) > 1, the indicator Xu,{xi) and then to get the maximizing 
tree 7a (x"). The computational cost of this algorithm is linear in n if d{n) = 
o{n), as proven by Csiszar and Talata (2006). 



5.2. Exponential inequalities for empirical probabilities 

The following result was proven in Galves and Leonardi (2008), we omit its 
proof here. 

Theorem 5.8. Assume the process Xt satisfies Assumption 1, then for any 
finite sequence w, any symbol a G A and any t > the following inequality holds 

1 —t^C 
^{ \Nn{w,a) - {n - d)p{wa)\ > t) < e=exp[ 



■ {n — d)£(wa) ' 
where 

C 



ie{a + ao) 

As a consequence of Theorem 5.8 we obtain the following corollary. 

Corollary 5.9. For any finite sequence w, with p{w) > 0, any t > and any 

sufficiently large n such that Nn{w) > 1 we have 

(a) maxaeAP(|p„(a|u') ~ p{a\w)\ > t) < 2ei \A\ exp [- g^g^lgfgj^] ; 

(b) p[ \L„M\ >t]<3ei \A\^ exp[- ,,^^;:g:ff:;g^) ] , 

where L^iw) = I]aGAP('^«) ^ogp{a\w) - logp„(a|w). 
Proof. To prove (a) observe that 

(n — d)p(wa) 



p{a\w) 



{n — d)p{w) 

N„{w,a) 



Then, summing and substracting the term ^^^^^^'^.^^ we obtain 

~Trr~\ 1 ^\ t \ - AT I \f ^\ ( \ \{n - d)p{w) ~ Nn{w)\ 

Nniw) [n-d)p[w) Nn[w)(n - d)p[w) ' 

+ 7 1\ / \ \Nn{w,a) - {n~ d)p{wa)\ 

[n — d)p(w) 
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Therefore, as < 1 we have 



F{\pn{a\w) ~ p{a\w)\ > t) < P{\{n - d)p{w) ^ N„{w)\ > ^('^ d)p{w) \^ 

+ F{\N,^M-{n-d)piwa)\>'-^^^^^) 

We can write Nn{w) ~ J2b£A ^niw, b) and p{w) ~ J^teAPi'^^)^ then the right 
hand side of the last expression can be bounded above by the sum 

^P(|A^„K6)-(n-dMu;6)l > ^t^) + 
beA ' ' 

V{\N,,{w,a)^{n-d)p{wa)\ > ~ d)pH y 
Using Theorem 5.8 we can bound above this expression by 
e= A + 1 exp - n - d) . 
This finishes the proof of (a). To prove (b) observe that 



»[|L„H| > t] <F[\J2^ogp{a\w){piwa)- 



a£A 



Nn{w,a) . I ■ 
n-d 2 



Using Theorem 5.8 we have that 
F[\J2^ogp{a\w){piwa)-^^^)\ > ^ 

aGA 



< '^P[\Nn{w,a)-{n-d)p{wa)\ > '^''^ 



aeA 



2 \logp{a\w)\\A\ 



i I ^1 r - d)t'^C -, /r 
< e= A exp ^ ^ . (5.10) 

On the other hand, using the definition of the Kullback-Leibler divergence^ 
Lemma 5.2 and part (a) of this CoroUary we obtain 

m,r iV^ Nn(w,a) , P(a\w) I t-, ^r^/-/iMi/ixx ti 

P[lE^^iog|U-;^| > -] < F[ WHIb(-N) > 2] 



aeA 



- J2^[ -Pn{a\w)\ > 



a£A 



I tp{a\w) 
2\A\ 



< 2ei |Ap exp[ ^""f^^^^'^'f 1. (5.11) 

' ' 64e|A|3(a + ao)^(wa)J ^ ' 

Summing (5.10) and (5.11) we obtain the bound in part (b) and we conclude 
the proof of Corollary 5.9. □ 
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